Search CORE

687 research outputs found

Making Differential Privacy Easier to Use for Data Controllers and Data Analysts using a Privacy Risk Indicator and an Escrow-Based Platform

Author: Fernandez Raul Castro
Zhu Zhiru
Publication venue
Publication date: 19/10/2023
Field of study

Differential privacy (DP) enables private data analysis but is hard to use in practice. For data controllers who decide what output to release, choosing the amount of noise to add to the output is a non-trivial task because of the difficulty of interpreting the privacy parameter

\epsilon

. For data analysts who submit queries, it is hard to understand the impact of the noise introduced by DP on their tasks. To address these two challenges: 1) we define a privacy risk indicator that indicates the impact of choosing

\epsilon

on individuals' privacy and use that to design an algorithm that chooses

\epsilon

automatically; 2) we introduce a utility signaling protocol that helps analysts interpret the impact of DP on their downstream tasks. We implement the algorithm and the protocol inside a new platform built on top of a data escrow, which allows the controller to control the data flow and achieve trustworthiness while maintaining high performance. We demonstrate our contributions through an IRB-approved user study, extensive experimental evaluations, and comparison with other DP platforms. All in all, our work contributes to making DP easier to use by lowering adoption barriers

arXiv.org e-Print Archive

Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach

Author: Fernandez Raul Castro
Wang Qiming
Publication venue
Publication date: 17/10/2023
Field of study

Most deployed data discovery systems, such as Google Datasets, and open data portals only support keyword search. Keyword search is geared towards general audiences but limits the types of queries the systems can answer. We propose a new system that lets users write natural language questions directly. A major barrier to using this learned data discovery system is it needs expensive-to-collect training data, thus limiting its utility. In this paper, we introduce a self-supervised approach to assemble training datasets and train learned discovery systems without human intervention. It requires addressing several challenges, including the design of self-supervised strategies for data discovery, table representation strategies to feed to the models, and relevance models that work well with the synthetically generated questions. We combine all the above contributions into a system, Solo, that solves the problem end to end. The evaluation results demonstrate the new techniques outperform state-of-the-art approaches on well-known benchmarks. All in all, the technique is a stepping stone towards building learned discovery systems. The code is open-sourced at https://github.com/TheDataStation/soloComment: To appear at Sigmod 202

arXiv.org e-Print Archive

Stateful data-parallel processing

Author: Castro Fernandez Raul
Publication venue: Computing, Imperial College London
Publication date: 01/03/2016
Field of study

Democratisation of data means that more people than ever are involved in the data analysis process. This is beneficial—it brings domain-specific knowledge from broad fields—but data scientists do not have adequate tools to write algorithms and execute them at scale. Processing models of current data-parallel processing systems, designed for scalability and fault tolerance, are stateless. Stateless processing facilitates capturing parallelisation opportunities and hides fault tolerance. However, data scientists want to write stateful programs—with explicit state that they can update, such as matrices in machine learning algorithms—and are used to imperative-style languages. These programs struggle to execute with high-performance in stateless data-parallel systems. Representing state explicitly makes data-parallel processing at scale challenging. To achieve scalability, state must be distributed and coordinated across machines. In the event of failures, state must be recovered to provide correct results. We introduce stateful data-parallel processing that addresses the previous challenges by: (i) representing state as a first-class citizen so that a system can manipulate it; (ii) introducing two distributed mutable state abstractions for scalability; and (iii) an integrated approach to scale out and fault tolerance that recovers large state—spanning the memory of multiple machines. To support imperative-style programs a static analysis tool analyses Java programs that manipulate state and translates them to a representation that can execute on SEEP, an implementation of a stateful data-parallel processing model. SEEP is evaluated with stateful Big Data applications and shows comparable or better performance than state-of-the-art stateless systems.Open Acces

Spiral - Imperial College Digital Repository

What Does it Take to be a Social Agent?

Author: Clodic Aurélie
Fernandez Castro Víctor
Hakli Raul
Publication venue: IOS PRESS
Publication date: 01/08/2020
Field of study

The aim of this paper is to present a philosophically inspired list of minimal requirements for social agency that may serve as a guideline for social robotics. Such a list does not aim at detailing the cognitive processes behind sociality but at providing an implementation-free characterization of the capacities and skills associated with sociality. We employ the notion of intentional stance as a methodological ground to study intentional agency and extend it into a social stance that takes into account social features of behavior. We discuss the basic requirements of sociality and different ways to understand them, and suggest some potential benefits of understanding them in an instrumentalist way in the context of social robotics.The aim of this paper is to present a philosophically inspired list of minimal requirements for social agency that may serve as a guideline for social robotics. Such a list does not aim at detailing the cognitive processes behind sociality but at providing an implementation-free characterization of the capacities and skills associated with sociality. We employ the notion of intentional stance as a methodological ground to study intentional agency and extend it into a social stance that takes into account social features of behavior. We discuss the basic requirements of sociality and different ways to understand them, and suggest some potential benefits of understanding them in an instrumentalist way in the context of social robotics.Peer reviewe

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Helsingin yliopiston digitaalinen arkisto

METAM: Goal-Oriented Data Discovery

Author: Fernandez Raul Castro
Galhotra Sainyam
Gong Yue
Publication venue
Publication date: 18/04/2023
Field of study

Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.Comment: ICDE 2023 pape

arXiv.org e-Print Archive

Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example

Author: Fernandez Raul Castro
Galhotra Sainyam
Gong Yue
Zhu Zhiru
Publication venue
Publication date: 23/07/2021
Field of study

Identifying a project-join view (PJ-view) over collections of tables is the first step of many data management projects, e.g., assembling a dataset to feed into a business intelligence tool, creating a training dataset to fit a machine learning model, and more. When the table collections are large and lack join information--such as when combining databases, or on data lakes--query by example (QBE) systems can help identify relevant data, but they are designed under the assumption that join information is available in the schema, and do not perform well on pathless table collections that do not have join path information. We present a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem. We then present Niffler, a system built to address the technical problem. We introduce algorithms for the main components of Niffler, including a signal generation component that helps reduce the size of the candidate views that may be large due to errors and ambiguity in both the data and input queries. We evaluate Niffler on real datasets to demonstrate the effectiveness of the new engine in discovering PJ-views over pathless table collections

arXiv.org e-Print Archive